local view
- North America > United States > California (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (9 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (9 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- North America > United States > California (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing
Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. W e introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. W e also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > California (0.04)
PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data
A BSTRACT We propose PyViT -FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwA V algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks. 1 I NTRODUCTION Foundation models (FM) for earth observations (EO) have gained traction following the success of large language models (LLM) and their demonstration of scaling laws (Kaplan et al., 2020). The premise is that training larger models on vast datasets enhances performance. This idea has been central to computer vision, where datasets like ImageNet (Deng et al., 2009) have enabled pre-training in both supervised and unsupervised settings, leading to breakthroughs in model design and training.
- North America > United States > Colorado (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > Italy > Marche > Ancona Province > Ancona (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework
Belal, Yacine, Maouche, Mohamed, Mokhtar, Sonia Ben, Simonet-Boulogne, Anthony
Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent GL approaches rely on dynamic communication graphs built and maintained using Random Peer Sampling (RPS) protocols. Thanks to graph dynamics, GL can achieve fast convergence even over extremely sparse topologies. However, the robustness of GL over dy- namic graphs to Byzantine (model poisoning) attacks remains unaddressed especially when Byzantine nodes attack the RPS protocol to scale up model poisoning. We address this issue by introducing GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of a fraction of Byzantine nodes. GRANITE relies on two key components (i) a History-aware Byzantine-resilient Peer Sampling protocol (HaPS), which tracks previously encountered identifiers to reduce adversarial influence over time, and (ii) an Adaptive Probabilistic Threshold (APT), which leverages an estimate of Byzantine presence to set aggregation thresholds with formal guarantees. Empirical results confirm that GRANITE maintains convergence with up to 30% Byzantine nodes, improves learning speed via adaptive filtering of poisoned models and obtains these results in up to 9 times sparser graphs than dictated by current theory.
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada (0.04)
- (2 more...)
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.
- Leisure & Entertainment > Sports (0.68)
- Transportation (0.47)